Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cachetools's LRUCache to cache manifest list #1187

Merged
merged 7 commits into from
Sep 24, 2024

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Sep 20, 2024

Closes #595, Reverts #787, Closes #1162

Reimplement manifest list cache using the cachetools library. This keeps the manifest list cache as a global cache and introduces the ability to specify the cache key as only manifest_list.

The cache result is stored as Tuple instead of List to prevent modification

Move the _manifests function from snapshots.py to manifest.py since its a global manifest cache and not specific to any 1 snapshot

@@ -620,6 +623,13 @@ def fetch_manifest_entry(self, io: FileIO, discard_deleted: bool = True) -> List
]


@cached(cache=LRUCache(maxsize=128), key=lambda io, manifest_list: hashkey(manifest_list))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the use of cachetools LRUCache, instead of functools.lru_cache solve the issues with the FileSystem on M1 Macs we were observing before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! the difference is that functools.lru_cache uses all of the function's args as the cache key, including the io arg.
Using cachetools, I can specify which argument to use as the cache key. In this case, only using manifest_list as the cache key

key=lambda io, manifest_list: hashkey(manifest_list)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's so clever - so the issue was with the function arguments being 'held up' by the cache preventing them from being GCed 🤯 🧠

This is next level @kevinjqliu 💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my best theory, haha. Regardless of the underlying cause, we should not be using io as the cache key!

@kevinjqliu kevinjqliu force-pushed the kevinjqliu/manifest-list-cachetools branch from 13fe7d6 to ade0fc5 Compare September 23, 2024 16:28
@sungwy sungwy merged commit 6b7f07a into apache:main Sep 24, 2024
8 checks passed
@sungwy sungwy added this to the PyIceberg 0.8.0 release milestone Sep 24, 2024
@kevinjqliu kevinjqliu deleted the kevinjqliu/manifest-list-cachetools branch September 24, 2024 15:48
@antonioalegria
Copy link

When is the release (0.8.0) planned for - would it be possible to release this fix on a 0.7.2 hotfix release? This solves #1162 bug that is blocking usage, unless there is a workaround. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug?] cannot run integration test Implement caching of manifest-files
3 participants